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1. Introduction 

We discuss applications of stochastic smoothing results to the design of efficient first-order methods for 
solving semidefinite programs. We focus here on the problem of minimizing the maximum eigenvalue of a 
matrix over a simple convex set Q (the meaning of simple will be made precise later), i.e. solve 



ABSTRACT. We use a rank one Gaussian perturbation to derive a smooth stochastic approximation of the 
maximum eigenvalue function. We then combine this smoothing result with an optimal smooth stochastic 
optimization algorithm to produce an efficient method for solving maximum eigenvalue minimization problems. 
We show that the complexity of this new method is lower than that of deterministic smoothing algorithms in 
' certain precision/dimension regimes. 
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in the variable X £ S n . Note that all primal semidefinite programs with fixed trace have a dual which can 
be written in this form. While moderately sized problem instances are solved very efficiently by interior 
point methods [Ben-Tal and Nemirovski, 2001] with very high precision guarantees, these methods fail on 
most large-scale problems because the cost of running even one iteration becomes too high. When coarser 
precision targets are sufficient (e.g. spectral methods in statistical or geometric applications), much larger 
problems can be solved using first-order algorithms, which tradeoff a lower cost per iteration in exchange 
^ ■ for a degraded dependence on the target precision. 

So far, roughly two classes of first-order algorithms have been used to solve large-scale instances of the 
| semidefinite program in (1). The first uses subgradient descent or a variant of the mirror-prox algorithm of 

[Nemirovskii and Yudin, 1979] that takes advantage of the geometry of Q to directly minimize A max (X). 
These methods do not exploit the particular structure of problem (1) and need 0(Dg/e 2 ) iterations to reach 
a target precision e, where Dq is the diameter of the set Q. Each iteration requires computing a leading 
eigenvector of the matrix X at a cost of roughly 0(n 2 logn) and projecting X on Q at a cost written pq. 
^ | Spectral bundle methods [Helmberg and Rendl, 2000] use more information on the spectrum of X to speed 

up convergence, but their complexity is not well understood. More recently, [Nesterov, 2007a] showed that 
one could exploit the particular min-max structure of problem (1) by first regularizing the objective (using a 
"soft-max" exponential smoothing), then using optimal first-order methods for smooth convex minimization. 
These algorithms only require 0{y/\ogn/ e) iterations, but each iteration forms a matrix exponential at a cost 
of 0(n 3 ). In other words, depending on problem size and precision targets, existing first-order algorithms 
offer a choice between two complexity bounds 

Q^ (n2l °j n + ^ and (gQV^!±M) (2) 

Note that the constants in front of all these estimates can be quite large and actual numerical complexity 
depends heavily on the particular path taken by the algorithm, especially for adaptive variants of the methods 
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detailed here (see [Nesterov, 2007b, §6] for an illustration on a simpler problem). In practice of course, these 
asymptotic worst case bounds are useful for providing general guidance in algorithmic choices, but remain 
relatively coarse predictors of performance for reasonable values of n and e. 

Many recent works have sought to move beyond these two basic complexity options. Overton and Wom- 
ersley [1995] directly applied Newton's method to the maximum eigenvalue function, given a priori infor- 
mation on the multiplicity of this eigenvalue. Burer and Monteiro [2003] and Journee et al. [2008] focus on 
instances where the solution is known to have low rank (e.g. matrix completion, combinatorial relaxations) 
and solve the problem directly over the set of low rank matrices. These formulations are nonconvex and their 
complexity cannot be explicitly bounded, but empirical performance is often very good. Lu et al. [2007] fo- 
cus on the case where the matrix has a natural structure (close to block diagonal). Juditsky et al. [2008] use a 
variational inequality formulation and randomized linear algebra to reduce the cost per iteration of first-order 
algorithms. Subsampling techniques were also used in [d'Aspremont, 2011] to reduce the cost per iteration 
of stochastic averaging algorithms. Finally, in results that are similar, Baes et al. [2011] use stochastic ap- 
proximations of the matrix exponential to reduce the cost per iteration of smooth first-order methods. The 
complexity tradeoff and algorithms in this last result are different from ours (roughly speaking, a 1/e term 
is substituted to the ^fn term in our bound), but both methods seek to reduce the cost of smooth first-order 
algorithms for semidefinite programming using stochastic gradient oracles instead of deterministic ones. 

In this paper, we use stochastic smoothing results, combined with an optimal accelerated algorithm for 
stochastic optimization recently developed by Lan [2009], to derive a stochastic algorithm for solving (1) 
which requires only 0(\/n/e) iterations, with each iteration computing a few sample leading eigenvectors 
of (X + ezz T /n) where z ~ AA(0,I n ). While in most applications of stochastic optimization the noise 
level is seen as exogenous, we use it in the algorithm detailed here to control the tradeoff between number 
of iterations and cost per iteration. The algorithm requires fewer iterations than nonsmooth methods and has 
lower cost per iteration than smoothing techniques. In some configurations of the parameters (n, 6,pq,Dq), 
its total worst-case floating-point complexity is lower than that of both smooth and nonsmooth methods. 
Overall, the method has a cost per iteration comparable to that of nonsmooth methods while retaining some 
of the benefits of accelerated methods for smooth optimization. 

The paper is organized as follows. In the next section, be briefly outline our stochastic smoothing al- 
gorithm for maximum eigenvalue minimization and compare its complexity with existing first-order algo- 
rithms. Section 3 details our main smoothing results on random rank one perturbations of the maximum 
eigenvalue function, highlighting in particular a phase transition in the spectral gap depending on the spec- 
trum of the original matrix. Section 4 uses these smoothing results to produce a stochastic algorithm for 
maximum eigenvalue minimization, and describes an extension of the optimal stochastic optimization al- 
gorithm in [Lan, 2009] where the scale of the step size is allowed to vary adaptively (but monotonically). 
Section 5 informally discusses extensions of our results to other smoothing techniques, together with their 
impact on complexity. Section 6 presents some preliminary numerical experiments. An appendix contains 
auxiliary material, including a detailed discussion of the cost of computing leading eigenpairs of a symmet- 
ric matrix and a proof of the phase transition result for random rank-one perturbations. 

Notation. Throughout the paper, we denote by A,(X) the eigenvalues of the matrix X G S n , in decreasing 
order. For clarity, we will also use A max (X) for the leading eigenvalue of X. When z denotes a vector in 

M. n , its i-th coordinate is denoted by Zj. We denote equality in law (for random variables) by = and =^ 
stands for convergence in law. 



2. Stochastic smoothing algorithm 

We will solve a smooth approximation of problem (1), written 

minimize f(X) = E [max^i,...^ A max (X + ^zj)] 
subject to X e Q, 



in the variable X G S n , where Q C M. n is a compact convex set, Zj are i.i.d. Gaussian vectors Zi ~ AA(0, I n ) 
and k > is a small constant (typically 3). We call /* the optimal value of this problem. The fact that 
A max (-) is 1-Lipschitz with respect to the spectral norm with \ ma x( z i z T) = yields 



E 



max A 
i=l,...,fc 



max I X + Zj,Z, 
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1=1,..., A; 

depends only on k. Jensen's inequality and EfzjZ^ 
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i=l,...,fe 
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i=l,...,fc V n 



This means that /(X) will be a c^e-uniform approximation of A max (X). We begin by briefly introducing 
the smoothing results on (3) detailed in Section 3, then describe our main algorithm. 

2.1. Smoothness of f(X). In Section 3, we will show that the function / has a Lipschitz continuous gra- 
dient w.r.t. the Frobenius norm, i.e. 



\\Vf(X)-Vf(Y)\\<L\\X-Y\\ F 



with constant L satisfying 



n 



L<C k - 



(4) 



where C k > depends only on k and is bounded whenever k > 3. We will see in Section 3 that this bound 
is quite conservative and that much better regularity is achieved when the spectrum of X is well-behaved 
(see Theorem 3.8). 

2.2. Gradient variance. Section 3 also shows that the function / is differentiable. Let <p be a leading 
eigenvector of the matrix X + ^Zi a zJ where 



'7-0 



argmax A max 

i=l,...,k 



X + -ZiZ { 



T 



n 



We will see that iq is unique with probability one. We have 



Vf(X) = E 



Tl 



and E 



v/P0| 



< i . 



(5) 



Therefore the variance of the stochastic gradient oracle 
Section 3 that this bound is often quite conservative. 



a? 1 



is bounded by one. Once again, we will see in 



2.3. Stochastic algorithm. Given an unbiased estimator for V/ with unit variance, the optimal algorithm 
for stochastic optimization derived in [Lan, 2009] will produce a matrix Xjv such that 



E[f(x N ) - n < 



ALDr 



+ 



4D, 



(6) 



N 2 ' ^/N^ 

after N iterations [Lan, 2009, Corollary 1], where L < C k n/e is the Lipschitz constant of V/ discussed in 
the previous section and q is the number of sample matrices </>^> T averaged in approximating the gradient. 
Once again, we write Dq the diameter of the set Q (see below for a precise definition) and pq, which 
appears in Table 1, the cost of projecting a matrix X G S„ on the set Q. 

Setting N = 2D Q ^/n/e and q = max{l, D Q /(e^/n)} will then ensure E[f(X N ) - /*] < 5e. We 
compare in Table 1 the computational cost of the smooth stochastic algorithm in [Lan, 2009, Corollary 1] 
in this setting with that of the smoothing technique in [Nesterov, 2007a] and the nonsmooth stochastic 
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Algorithmic complexity 


Num. of Iterations 


Cost per Iteration 


Nonsmooth 

Stochastic smoothing 
Deterministic Smoothing 


°(*) 

Q ^Dqt/E\ 


0(pq + n 2 logn) 
(p Q + max 1 1, j^L \ n 2 log nj 
0{pq + n 3 ) 



Table 1. Worst-case computational cost of the smooth stochastic algorithm detailed here, 
the smoothing technique in [Nesterov, 2007a] and the nonsmooth subgradient descent 
method. 



averaging method. Recall that the cost of computing one leading eigenvector of X + vv T is of order 
0{n 2 log n) while that of forming the matrix exponential exp(X) is 0(n 3 ) [Moler and Van Loan, 2003]. 

Table 1 shows a clear tradeoff in this group of algorithms between the number of iterations and the cost 
of each iteration. In certain regimes for (n, e), the total worst-case complexity of the smooth stochastic 
algorithm is lower than that of both smooth and nonsmooth methods. This is the case for instance when 

c\ max < 1, — % \ n 2 log n < po < C2n 5 ^ 2 \/\og n 

for some absolute constants c\ , c-i > 0. In practice of course, the constants in front of all these estimates 
can be quite large and the key contribution of the algorithm detailed here is to preserve some of the benefits 
of smooth accelerated methods (e.g. fewer iterations), while requiring a much lower computational (and 
memory) cost per iteration by exploiting the very specific structure of the A max (X) function. 

3. Efficient Stochastic smoothing 

In this section, we show how to regularize the function A max (X) using stochastic smoothing arguments. 
We begin by recalling a classical argument about Gaussian regularization; we then improve smoothing 
performance by exploiting some explicit structural results on the spectrum of rank one updates of symmetric 
matrices. 

3.1. Gaussian smoothing. We first recall a standard result on Gaussian smoothing which does not exploit 
any structural information on the function A max (X) except its Lipschitz continuity. 

Lemma 3.1. Suppose f : R n — > R is Lipschitz. continuous with constant \x with respect to the Euclidean 
norm. The function g such that 

g (x) = E[f(x + ez)] 
where z ~ AA(0, I n ) and e > 0, has a Lipschitz continuous gradient with 

\\Vg( X )-Vg(y)\\<h^t\\ x .yl 

e 

Proof. See Nesterov [2011] for a short proof and applications in gradient-free optimization. ■ 
Let us consider the function 

E[A max (X + (e/Vn)U)] , 

where U G S n is a symmetric matrix with standard normal upper triangle coefficients. Using again Jensen's 
inequality, the fact that A max (X) is 1 -Lipschitz with respect to the spectral norm and bounds on the largest 
eigenvalue of U (which follow easily from either Trotter [1984] or Davidson and Szarek [2001]), we see 
that this function is an e-approximation of A max (X). 
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The lemma above shows that it has a Lipschitz continuous gradient with constant bounded by O (n 3 / 2 /e) . 
This approach was used e.g. in [d' Aspremont, 2008] to reduce the cost per iteration of a smooth optimization 
algorithm with approximate gradient, and by [Nesterov, 2011] to derive explicit complexity bounds on 
gradient free optimization methods. 

3.2. Gradient smoothness. We recall the following classical result (which can be derived from results 
in [Kato, 1995] and is proved in the appendix for the sake of completeness) showing that the gradient of 
Amax(^) is smooth when the largest eigenvalue of X has multiplicity one, with (local) Lipschitz constant 
controlled by the spectral gap. 

Theorem 3.2. Suppose X G S n and call {Aj(X)}" =1 the decreasingly ordered eigenvalues of X. Suppose 
A m ax(^)> the largest eigenvalue of X, has multiplicity one. Let Y be a symmetric matrix with ||1^||f = 1 
and call 

d 2 X max (X + tY) 
9(X,Y) = bm - 2 . 

Then the local Lipschitz, constant of the gradient is given by 

||VA max (X)|| L = sup g{X,Y) = \- 1 (7) 

This result shows that if we want to produce smooth approximations of the function A max (X) using 
random perturbations, we need these perturbations to increase the spectral gap by a sufficient margin. We 
will see below that, up to a small trick, random rank one Gaussian perturbations of the matrix X will suffice 
to achieve this goal. 

3.3. Rank one updates. For X G S n , we call Aefi" the spectrum of the matrix X, in decreasing algebraic 
order. Whenever v ^ is not an eigenvector of X, the leading eigenvalue l\ of the matrix X + (e/n)vv T , 
with e > 0, is always strictly larger than Ai [see Golub and Van Loan, 1990, §8.5.3] and we write l\ = Ai+i. 

We assume without loss of generality that X is diagonal. If X were not diagonal, we could simply 
rotate v. The variable t is the unique positive solution of the secular equation 

W e t ££(Ai-A<) + * 

where Vj are the coefficients of the vector v. We plot the function s for a sample matrix X in Figure 1. 

Equation (8) implies that we have almost explicit expressions for the eigenvalue decomposition of the 
matrix 

X + -vv T 
n 

where v € W 1 and e > 0. Having assumed that X is diagonal. Golub and Van Loan [1990, Th. 8.5.3] also 
shows that if Vj 7^ for i = 1, . . . , n and e > 0, then t > and the eigenvalues of X and X + (e/n)vv T are 
interlaced, i.e. 

X n (X) < X n (X + -vv T ) <...< A max (X) < A max (X + -vv T ). 

n n 



By construction, the function 



"(0 = --^ 
w e t 



is an upper bound on s(t) on the interval (0, 00). This means that the positive root of s + (t) is a lower bound 
on the positive root t* of s(t) and we get 



n 

5 



-4 -2 2 4 6 

Ai(X) + t 



FIGURE 1. Plot of s(t) versus X±(X) + t. The matrix has dimension four and its spectrum 
is here {—2, —2, 0, 1}. The three leading eigenvalues of X + evv T are the roots of s(t), the 
fourth eigenvalue is at -2. 



Using interlacing, we have 

X 2 (X + -vv T ) < Xi(X) < XUX) + t* = XUX + -vv T ). 
n n 

This gives a lower bound on the spectral gap of the perturbed matrix 

2 

^<t* < X X {X + -vv T ) - X 2 {X + -vv T ) , (9) 
n n n 

which will allow us to control the smoothness of V/(X). 

3.4. Low rank Gaussian smoothing. We now come back to the objective function of problem (3), written 

f(X) 4 E 



max A max [X + -z { zj 
i=i,...,k V n 

where Z{ are i.i.d J\f(0, I n ) and k > is a small constant. We first show that we can differentiate under the 
expectation in the definition of f(X). 

Lemma 3.3. Let X\ + T be the largest eigenvalue of the matrix X + (e/n)zz T , where X is a given deter- 
ministic matrix and z ~ A/"(0, I n ). Then the random variable T has a density. 

Proof. As usual, we call {Aj}" =1 the decreasingly ordered eigenvalues of X and assume here that 
Amax(^) has multiplicity I < n (if I = n there is nothing to show). By rotational invariance of the standard 
Gaussian distribution, we can and do assume that X is diagonal. As we have seen before, T is therefore the 
positive root of the equation 

ir^2 2 n 2 

= s (T)- n ^= lZl ^ Zf 



and note that s(t) is increasing in t. Hence, 

P(T >t) = P{s(T) > s(t)) = P(0 > s{t)) 

'2 




Pt(u)du = I(t) . 



where p t is the density of the random variable 

z? \ - zj 



l^i=l ■ 



If the integral I(t) can be differentiated under the integral sign, then we can differentiate P(T > t) and we 
will have established the existence of a density for T and hence for Ai + T. Now pt(x) is a very smooth 
function of both t and x. Indeed, it is a convolution of n — I densities that are smooth in t and x. Recall that 
if X has density / and t > 0, X/t has density tf(t-). Recall also that a random variable with xf distribution 
has density 

f l {x) = ^j Y) x l ' 2 - 1 eM-x/2). 

So it is clear that for any k, any t > 0, and any a > 0, (t + a)fi((t + a)x) is C°° in t. Applying this result 
in connection to [Durrett, 2010, Th. A.5.1], we see that Yt has a density which is a smooth function of t > 0. 
Indeed, it is C°° on (0, oo). Moreover, it is easy to see that the conditions of [Durrett, 2010, Th. A.5.1] are 
satisfied for p t , which guarantees that we can differentiate under the integral sign. This shows that for any 
t > 0, the function g such that g(t) = P(T > t) is differentiable in t. It is also clear that P(T = 0) is 0, so 
this distribution has no atoms at 0. We conclude that T has a density on (0, oo). ■ 

We then directly obtain the following corollaries. The first shows that two perturbed eigenvalues obtained 
from independent rank one perturbations are different with probability one. 

Corollary 3.4. Suppose = A max (X + (e/n)zi&[) and l± : 2 = A max (X + (e/n^zj), where z\ and Z2 
are independent with distribution A/"(0, I n ). Then l\ \ ^ l\ % with probability one. 

Proof. Follows from Lemma 3.3 since Zx,i and are two independent draws from a distribution with a 
density on (0, oo) and P(l 1A = 0) = P(l 1>2 = 0) = 0. ■ 

The second corollary shows that the maximum of (independent) perturbed eigenvalues is differentiable 
with probability one. 

Corollary 3.5. Suppose = A max (X + {e/n)zizj), where Z{ are i.i.d. with distribution N(0,I n ) for 
i = 1, . . . , k. Then the mapping X — > maxj = i ) ^ In is differentiable with probability one. 

Proof. This corollary follows from the previous one and the fact that if g and h are two differentiable 
functions, then max(g, h) is differentiable at all points x such that g(x) ^ h(x). Indeed, in that case 
[max(5, h)]'(x) = g / (x)l g ( x )>h(x) + h'(x)l g{x)<h{x) . m 

We now use these preliminary results to produce a bound on the Lipschitz constant of the gradient of 
f(X) defined above. 

Proposition 3.6. Let {zi}\ =l be i.i.d. M(0, I n ). The function f(X) such that 



f(X) = E 



max A max (X + {e/n)Ziz\ ) 

;=l,...,fe 



is smooth and the Lipschitz constant of its gradient w.r.t. the Frobenius norm is bounded by 

n 



Ik 

L < Ch— where Cu = — =-; 



when k > 3. 



Proof. Writing Zj the first coordinate of the vector Z( and combining Theorem 3.2, Corollary 3.5 and the 
spectral gap bound in (9) shows 

1 



IVA max 



n 

— mm 

2e i=i,....k -l a 





n k 


< E 





where xf is X 2 distributed with k d.f. The fact that 



E 



2 fc / 2 r(fe/2) 



r(|-i) 



whenever k > 3 yields the desired result, since T(x + 1) = xT(x). \ 



Note that the bound above is a bit coarse; numerical simulations show that for independent Af(0, 1) random 
variables {zj}? =1 , 

-E[l/max{z?,z|,z|}] =.75... 

while C3 = 2.12..., for example. We could of course use the density of the minimum above to get a more 
accurate bound but Ck would not have a simple closed form then. 

3.5. Gradient variance. In this section, we will bound the variance of the stochastic gradient oracle ap- 
proximating V/. Let us call 

g(X, z) = A max (X + -zz T ) . 

n 

Because of the rotational invariance of both A max (-) and of the Gaussian distribution, we can assume without 
loss of generality that X is diagonal and that its largest eigenvalue has multiplicity I. 

Lemma 3.7. Suppose w.l.o.g. that the matrix X G S n is diagonal, and z% ~ AA(0, I n ), the gradient of 



f{X) = E 



max A max pT + (e/n)^ ) 

i=l,...,k 



is also diagonal. It is given by 



V/(X) = E[<Mf ] 

T 



where 0j o is the leading eigenvector of the matrix X + f;Zi zf Q , and 



10' 

i = argmaxA max [X + -ZizJ 
i=l,...,k v n 



We have 

E [Ui <f>Z 
where Tr (V/(X)) = 1 by construction. 



mjl}\\ 2 F ] 



l-Tr(V/(X) 2 ) <1, 



(10) 



Proof. As above, z ~ M(0, I n ) means that z is not an eigenvector of X with probability one. Call Xi(X) 
the eigenvalues of X in decreasing order. This means that g(X, z) has multiplicity one and we call <j)(X, z) 
the corresponding eigenvector. We know that Vg(X, z) = <fi(X, z)(j)(X, z) T with 



where c > is a normalizing factor. Recall that g(X, z) is the largest root of x(A) = 0> where 

v^i 2 n 2 

Call s a vector of ±1 and z[s] = s o z, i.e z[s]j = SjZj for i = 1, . . . , n. The secular equation above depends 
only on zf, hence l\(X, z) = g(X, z) also depends only on zf and we have, for any z and s, 

Xi{X + -zz T ) = Ai(X + -z{ 5 ]z[s] T ), 
n n 

which means 

g(X,z)=g(X, 5 oz)=g(X,z[ 5 }). 

Let us call 

M(X,z) = cp(X,z)(j)(X,z) T . 

We use an invariance argument to show that the symmetric matrix V/(X) = E[M(X, z)] is in fact diagonal. 
We write A = Vf(X) in what follows to simplify notations. Take a vector z, change its i-th coordinate to 
— Zj and call $W the corresponding s sign vector. The i th coordinate of cf> is changed to its opposite, but all 
the other coordinates remain the same, while g(X, z) = g(X,s® o z). This means that 

M id (X, z) = -Mij(X,5 {i) o z) , when % / j. 
The coordinatewise product of «W and z has the same law as z, i.e. z = o z, so 

Mij(X,z) =Mi d (X,s®oz), 

and we conclude from the first equation that E*[Mij(X, z)\ = — E[Mjj(X,sW o zj\ and from the second 
equation that E[M id (X, z)} = B[Mij(X,s^ o z)], hence 

A,j = E[M y (X, z)] = - E[Mij(X, *)] = when i ^ j, 

and E[M] is diagonal, as announced. We now focus on the variance E[||<^> T — E[</></> T ] \\ 2 F ]- We can rewrite 
this expression as 

|| # T _ E[# T]||2 = Tr (# T _ E[# T ])2 

Using the fact that <^ T = 1, and E[</>0 T ] = A we see that 

E[Tr(^ T - E[# T ]) 2 ] = Tr^ - Trvl 2 = 1 - Tr A 2 < 1 
Furthermore, recall that A diagonal means Tr A 2 = Y17=i anc ^ Ym=i -^M = 1 with A^i > 0. ■ 

Simply using the fact that <pi is an eigenvector, we have of course 

||^ o <-E[^ o 0f o ]|| 2 .<4 (11) 

which means that the gradient will naturally satisfy the "light-tail" condition A2 in [Lan, 2009] for a 2 = 4. 
The bound in (10) together with the proof above show that when the spectral gaps \i(X) — Xi(X) are large, 
the diagonal of Vf(X) is approximately sparse. In that scenario, Tr(V f(X) 2 ) is close to Tr(V f(X)), 
hence close to one, and the variance of the gradient oracle is small. 
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3.6. A phase transition. We are here investigating the properties of a random rank one perturbation of a 
deterministic matrix X, specifically X(e) = X + (ej ^/n)zz T , where z ~ A/"(0, I n ). As we will see, the 
bounds we obtained above are quite conservative and the Lipschitz constant of the gradient is in fact much 
lower than n/e when the spectrum of X is well-behaved (in a sense that will be made clear later). 

In particular, we will observe that there is a phase transition phenomenon in e. Let us call T = 
A m ax(AT(e)) — A max (AT). If the perturbation scale e is small, T is of order 1/n (the worst-case bound 
we obtained above). If e is large, T is of order one. And if e has a critical value, then T is Op(l/ \fn)- 

The main idea behind the results we present below is the following. We are looking for the zeros of a 
certain random function, which can be seen as a perturbation of a deterministic function. Hence, it is natural 
to use ideas used in asymptotic root finding problems [see Miller, 2006, pp. 36-43], to expand the solution 
in powers of the size of the perturbation. We note that a similar idea was used in [Nadler, 2008], which 
focused on a different random matrix problem. We have the following theorem. 

Theorem 3.8 (Phase transition for the largest eigenvalue: rank one perturbation). Let X be a symmetric 
matrix with eigenvalues \\ > A2 > . . . > A n . Suppose X\ has multiplicity I and the gap between Ai and 
A/+i, 7, stays bounded away from 0. (Implicitly everything can change with n; I is held bounded.). Call 
Ai — Aj = 7 + 5i,for i > I. Consider the matrix 

X(e) =X + -zz T , 
n 

where z is a vector with i.i.d M(0, 1) entries. Call l\{e) the largest eigenvalue of the perturbed matrix X(e). 
We assume that ex 1. Define eo by 

1 1 A 1 



We note that cq is actually a function of n, but we do not write it explicitly to simplify notations. We also 
assume that there exists C > 0, independent ofn, such that 



7 2 > nf^ (7 + ^-) 2 > 



and that n is going to infinity. Call, for i.i.d M(0, 1) random variables {zj}™ =1 , 

1 n z 2 - 1 

hi, 7 + Sj 



3=1+1 
1 A A 



and Yli=i z i = X? a X? random variable with I degrees of freedom, independent of £1 (note that the 
estimates of the size and £1 are key in all the results that follow). We have the following three situations: 

(1) I/O < e < eo, i.e 6q — e stays bounded away from when n grows, 

Wi Wo ( 1 

/ 1 (e) = A 1 + ^ + ^ + Op - 

where 

Wi = - X \. and W 2 = W?£i. 
1/e - l/e 



(2) Ife = e , 



l 1 (e) = X 1 + ] ^ + Op(- 
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where 

£1 + a/£ + 4 X? d 



Wi 



2Ci 

(3) Ife> eo, i.e e — eo stays bounded away from when n grows, call to > 0, the solution of 

i i A i 



Note that to < (1 — l/n)e. Then 



e n • m 1 *o + 7 + <5j 



Here, 



where 



j z \ x Wi ^ / 1 

/i (e = Ai + t + ^= + Op - 



vr 1 = iM 

1 C(*>) ' 

V™ *o + 7 + Oi 



j=H-i 
1 A I 



C(<d) = ~ JZ u , .:. F\2 = 0(1) 



;1 + 7 + h) 2 



The proof of this theorem is in the Appendix. The phase transition can be further explored in the situation 
where e — eo is infinitesimal in n but not exactly 0. 

We are especially concerned in this paper with random variables of the type 

max A max (X + (e/n)zizf) - A max (X) 
i=l,...,fc 

for i.i.d Zi's. The previous theorem gives us an idea of the scale of this difference, which clearly depends 
on e and the whole spectrum of X. It is also clear that taking a max over finitely many k's does not change 
anything to the previous result as far as scale is concerned. The previous theorem shows that our uniform 
bound on the inverse of the gap cannot be improved. However, in many situations, the gap is much greater 
than 1/n and the worst case bound on the Lipschitz constant of f(X) is very conservative. 

4. Stochastic composite optimization 

In this section, we will develop a variant of the algorithm in [Lan, 2009] which allows for adaptive 
(monotonic) scaling of the step size parameter. For the sake of completeness, we first recall the principal 
definitions in [Lan, 2009], adopting the same notation, with only a few minor modifications to allow the full 
problem to be stochastic. We focus on the following optimization problem 

mm*(a;) := f(x) + h(x), (12) 
xeQ 

where Q C W 1 is a compact convex set. We let || • || be a norm and write || • ||* the dual norm. We assume 
that the functions f(x) and h{x) are defined by 

f(x) = B[f(x,0] and h{x) = E[h(x,£)], 

for some random variable £ G M. d , we write ^>(x, £) := f(x, £) + h(x, £). 

We also assume that £) is convex for any £ 6 M. d , that ^(x, £j) — ^f(x) > a.s., and that the function 
f(x) is convex with a Lipschitz continuous gradient 

l|V/(x)-V/(y)||, <L\\x-y\\, for all x, y e Q, 

n 



and that h(x) is a convex Lipschitz continuous function with 

\h(x) — h(y)\ < M\\x — y\\, for all x,y € Q. 
Furthermore, we assume that we observe a subgradient of \P through a stochastic oracle G(x, £), satisfying 

E[G{x,Z)]=g{x)ed9{x), (13) 
E[\\G(x,0-9(x)\\l]<a 2 . (14) 
We let uj{x) be a distance generating function, i.e. a function such that 

Q° = I x G Q : 3j/ G R p , x G argminfy 7 ^ + w(u)] > 
[ ueQ J 

is a convex set. We assume that ui(x) is strongly convex on Q° with modulus a with respect to the norm 
|| • || , which means 

(y — x) T (Voj(y) — Vw(x)) > a\\y — x\\ 2 , x, y £ Q°. 
We then define a prox-function V(x, y) on Q° x Q as follows 

y) = u(y) - [u(x) + Vu(x) T {y - x)}, (15) 

which is nonnegative and strongly convex with modulus a with respect to the norm || • ||. The prox-mapping 
associated to V is then defined as 

p x'"(y) = argmin{y T (z - x) + V(x, z)}. (16) 

This prox-mapping can be rewritten 

P® ,U3 (y) = &rgmm{z T (y - Vuj(x)) +u(z)}, 

and the strong convexity of w(-) means that Pj (•) is Lipschitz continuous with respect to the norm || • || 
with modulus 1/a (see Nemirovski [2004] or [Hiriart-Urruty and Lemarechal, 1993, Vol. II, Th. 4.2.1]). 
Finally, we define the cu diameter of the set Q as 



^1/2 

zGQ zeQ 

finally, we let 



Duj,q = (maxw(z) -minu;(z)) ' , (17) 

2SQ zeQ 



a; w = argminu^x), 

which satisfies 

7}\\x-x w \\ 2 < V(x w ,x) < w(x)-x(x w ) < Dl Q , for all x G Q. 

4.1. Stochastic composite optimization for semidefinite optimization. We can use the results of Sec- 
tion 3 to derive explicit performance bounds on the algorithm in [Lan, 2009, §3] for problem (3). If we 
define the stochastic gradient oracle 

i I 

G(X,z) = -Y,MT (18) 
q i=i 

where 4> be a leading eigenvector of the matrix X + ^Zi zJ Q where 

io = argmaxA max [X + -ZizJ) , 
i=i,...,k \ n J 

where Zi are i.i.d. Gaussian vectors Zj, ~ A/"(0, I n ) and k > is a small constant (typically 3). [Lan, 2009, 
Corollary 1] implies the following result on the complexity of solving (3) using the AC-SA algorithm in 
[Lan, 2009, §3]. 
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Proposition 4.1. Let N > 0, and write f* the optimal value of problem (3). Suppose that the sequences 
X t , X™ d , X^ 9 are computed as in [Lan, 2009, Corollary 1] using the stochastic gradient oracle in (18). 
After N iterations of the AC-SA algorithm in [Lan, 2009, §37, we have 



4nV2C k D 2 , r A\f2D r , t 



Proof. Using the bound on the variance of the stochastic oracle G(X, z), we know that G satisfies (13) 
with a 2 = l/q. Section 3 also shows that the Lipschitz constant of the gradient is bounded by Cy-n/e. If we 
pick || • as the prox function, [Lan, 2009, Corollary 1] yields the desired result. ■ 

Setting N = 2DQy/n/e and q = m&y:{l, Dcj/(e^/n)} in the convergence bound above will then ensure 
E[/(^at) — /*] = O(e). In the section that follows, we detail a version of the AC-SA algorithm with 
adaptive (but monotonically decreasing) step-size scaling parameter. 

4.2. Stochastic composite optimization with line search. The algorithm in [Lan, 2009, §3] uses worst 
case values of the Lipschitz constant L and of the gradient's quadratic variation a 2 to determine step sizes 
at each iteration. In practice, this is a conservative strategy and slows down iterations in regions where 
the function is smoother. In the deterministic case, adaptive versions of the optimal first-order algorithm 
in [Nesterov, 1983] have been developed by Nesterov [2007b] among others. These algorithms run a few 
line search steps at each iterations to determine the optimal step size while guaranteeing convergence. The 
algorithm in [Lan, 2009] is a generalization of the first-order methods in [Nesterov, 1983, 2003] and, in 
what follows, we adapt the line search steps in Nesterov [2007b] to the stochastic algorithm of [Lan, 2009, 
§3]. Here, we will study the convergence properties of an adaptive variant of the algorithm for stochastic 
composite optimization in [Lan, 2009, §3]. 

Algorithm 1 Adaptive algorithm for stochastic composite optimization. 

Input: An initial point x ag = x\ = x w 6 W 1 , an iteration counter t = 1, the number of iterations N , line 
search parameters -y mm ^max ^ -y > o, with ^ d < 1. 
1: Set 7 = 7 max . 



for t = 1 to N do 

Define xf d = j^x t + j^x a t 9 

Call the stochastic gradient oracle to get G(xY ld , £t). 

repeat 

Set 7t = ^. 

Compute the prox mapping x t +i = P Xt (-ftG(xY ld , &)). 

o et ag _ _2_ , t^l ag 

oei x t+1 — t+1 &t+l > t+l X 



9: until *(x^ l9 Ct+i) < ^(xT d ^t) + (G(xT d ^t),x^ 1 -xr d ) + ^\\xti 1 -xT d \\ 2 + 2M\\x^ +1 

x rnd|| or y < yron_ jf ex j t con djtion fails, set 7 = 77^ and go back to step 5. 
10: Set 7 = max{7 min ,7}. 
11: end for 

Output: Apointx^ +1 . 



In this section, we first modify the convergence lemma in [Lan, 2009, Lemma 5] to adapt it to the line 
search strategy detailed in Algorithm 1 . Note that a particularity of our method is that testing the line search 
exit condition uses two oracle calls, the current one in £ t and the next one in £t+i. This last oracle call is of 
course recycled at the next iteration. 
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Lemma 4.2. Assume that £t) jj convex for any given sample of the r.v. £t. Le? xt,x™ d , x^ 9 be computed 
as in Algorithm 1. Suppose also that 7 and these points satisfy the line search exit condition in line 9, i.e. 

a 9 „md 1 1 2 I n a a II ag mrf 1 1 



,7V 



then, for every x in the feasible set, we have 

4M 2 7t 2 



A7t[*K^i,6+i) - + F(x t+ i,x) < (A - l) 7t [*(x^,6) - + ^(x t ,x) + 

+7t(*(x,^)-*(x)) 



a 



Proof. As in [Lan, 2009, Lemma 5], we write dt = x t +\ — x t and use the parameter fi t = (t + l)/2 for 
step sizes so that x"?j — x™ d = dt/ fit- If the current iterates satisfy the line search exit condition, the fact 
that a||dt|| 2 /2 < V(xt,xt+i) by construction yields 

A7^«ii,6+i) < Mn^M) + (G^,^),^! - *? d )] + f ll^ll 2 + 2 7t MR|| 

< Mtm*? d ,(it) + (G(xT d ,^),x a t ii - xT d )} + V(x t ,x t+1 ) - j\\dt\\ 2 + 2rf t M\\dt 
Using the convexity of ^(-, £ f ) we then get 

Pt-ytM^.tt) + (G(xT d ,^),x^ 1 - xT d )} 

= (fit - lhtM^Zt) + (G{xt d , it), x a t 9 ~x? d )} +imx? d ,it) + (G(xT d ,it),x t+ i -xT d )} 
< (fit - i)jMx a t 9 ,b) + 7t[*(*r d ,6) + (G(xT d ,it),x t+l - xT d )}. 

Combining these last two results and using the fact that bu — au 2 /2 < b 2 / (2a) whenever a > 0, we obtain 

A7^«ii,£ m ) < (fit - ihMx a t \it) + it[^(xT d ^t) + (G(xr d ,^),x t+l - xT d }] 

+V(x t ,x t+ i) - ^|K|| 2 + 2 lt M\\d t \\ 

< (fit - l) lt ^(x a t 9 M) + ltMxT d ,it) + (G(xT d ,tt),x t+1 - xT d )] 
A^}M 2 

+V(x t ,x t+1 ) + -ii . 

a 

For any x in the feasible set, we can then use the properties of the prox mapping detailed in [Lan, 2009, 
Lemma 1], withp(-) = 'yt{G(xf ld , £t), • — x^) together with the convexity of and the definition of 

xt+i in Algorithm 1 to show that 

7t[*(xT d ,& + (G(xT d ,tt),x t+1 - xT d )\ + V(x t ,x t+1 ) 

< lt ^(xT d , 6) + lt(G(xT d , 6), x - xT d ) + V(xt, x) - V(x t+U x) 

< j t ^(x,Ct) + V(x t ,x) - V(x t +i,x), 

and combining these last results shows that 

4-y 2 .M 2 

ft7t*K+i.&+i) < (^-l)lt^{x a t 9 ^t)+l^{x,it) + V(x u x)-V(xt + ux)+ '■ 



a 

and subtracting f3 t ^/t^(x) from both sides yields the desired result. ■ 



We are now ready to prove the main convergence result, adapted from [Lan, 2009, Corollary 1]. We 
simply stitch together the convergence results we obtained in Lemma 4.2 for the line search phase of the 
algorithm, with that of [Lan, 2009, Lemma 5] for the second phase where 7 = j min . Note that the step size 
is still increasing in the second phase of the algorithm because jt = ^ mm (t + l)/2. 
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Proposition 4.3. Let N > 0, and write \& (x* ) ?/2e optimal value of problem (12). Suppose that the sequences 
Xt , x™ d , x^ 9 are computed as in Algorithm 1, with line search parameter 7 initially set to 7 = ^ max with 

^ max < \/6aA^Q in j f_^_ maxl (2 0) 

7 -(^ + 2)3/2(4^ + ^)1/2 7 mm l2L' 7 J l2Uj 

w/f/j 7^ < 1. A/ter ./V iterations of Algorithm 1, we have 



E[*W +1 )-**]<^ 2 + ^ T ^ 



— E 



2(4M 2 + a 2 ) \ , 
X It 



a 

t=l 



(21) 



and a simpler, but coarser bound is given by 



, max 



E [•«?„> " **] < ^ + 2^rWgg±g ( 2_ P (T 7 . tf) + 1 - p{Ty,N) ) . (22) 



iV 2 ' v /]y \7 mm ' 

where p(T 7 , JV) = (T 7 + 2) 3 /(A r + 2) 3 . 
Proof. Lemma 4.2 applied at x* shows 

A 7 ^«+i,£m) - #(x*)] + F(x t+1 ,x*) < (A - l)7t[*(^,e*) " *0O] + V(x t ,x*) + 



4M 2 7 2 



hence, having assumed $(x, — VP (a;) > a.s., 

(A+i- 1)7*^(^,6+1)-^^*)] < /3t7t[*Wi,et)-*(^)] 

< ( A -i) 7t [*(x^em)-*(0] + ^ Lk 

a 

+7 t (*(x* 5 ^)-*(x*)) + F(x t ,x)-y(x m! x) 
whenever the line search successfully terminates, with the last term satisfying 

max f.L 1 1 \ 

E[7*(*(^,6) " *(**))] < 2 E[*(x*,C t ) - *(x*)] = 

using again \P(x*, £t) — \P(x*) > a.s.. When the line search fails jt = j mm (t + l)/2 is deterministic and 
[Lan, 2009, Lem. 5 & Th. 2] show that 

- lhtMxftj) - *(x*)] < (A - l) 7 t[^(4 ff ) - V(x*)] + V(x t ,x*) - V(x t+1 ,x*) + A(x*) 

where 

a / * w , A * v , 2(4.M 2 + ||^|| 2 ) 7 2 

A(x)<7t(5 t ,x -x t )H . 

a 
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with 5 t = G(xf ,£t) -g(xt) and j t (5 t ,x* — x t ) < jt\\5t\\*\\x* -%t\\- We ca H t = T 1 the iteration where 
the line search first fails. Combining these last results, using f3\ = 1, we obtain 



(Av + i-l)77vE[*(*^ +1 )-*0r*)] < D2,q+5> 



t=i 



a 



+C9r 7 -l)7r r E[*(x^)-*( a ^ 1 6 

iV 



< D l,Q + E E 



jt(St,x* - x t ) + 
4M 2 7 2 



2(4_M 2 + ||,5 t || 2 )^ 



a 



t=l 



a 



N 



2(4M 2 + ||^|| 2 ) 7t 2 



2(47Vi 2 + <r 2 ) 2 



ft: 



t=l 



because E[^(x^)-*(x° 9 ,^)] =0. Using the fact that + l ) q < {N + q) q+1 / {q + 1) for q = 1,2 

then yields the coarser bound. ■ 

We observe that, as in [Nesterov, 2007b], allowing a line search slightly increases the complexity bound, 
by a factor 

/ 7YICLX \ 

- P (T 7 ,iv) + i-p(r 7 ,iv) , 



T 

where p(Ty,N) = (T 7 + 2) 3 /(N + 2) 3 . We will see however that overall numerical performance can 
significantly improve because the algorithm takes longer steps. 

5. Extensions 

In this section, we discuss possible extensions of the stochastic regularization techniques, their efficiency 
and regularity. 

5.1. GUE smoothing. We have chosen to analyze the rank one perturbation because of its numerical effi- 
ciency and mathematical simplicity. However, many other random smoothing algorithms are possible and 
modern random matrix theory offers tools to understand their properties. We expect that some of them will 
lead to better worst case bounds than the rank one perturbation methods we have considered here. 

A case in point is the following. Consider a matrix U from the Gaussian Unitary Ensemble (GUE). 
Matrices from GUE are Hermitian random matrices with complex Gaussian entries, i.i.d JV/c (0, 1) above 
the diagonal and i.i.d TV (0, 1) on the diagonal. Recall that if zc is A/"c (0, 1), z<c = (zi + IZ2) / y/2, where zi 
and Z2 are independent with distribution M (0, 1). 

In what follows, X is a deterministic matrix and U is a random GUE matrix. We assume, without loss of 
generality, that the largest eigenvalue of X is bounded (if not, we can always shift X by a multiple of I„, 
which takes care of the problem). 

A natural smoothing of X mSLX (X) is fcuE{X) = E[A max (X + (e/ y/n)U)], where U is a GUE matrix. 
This type of matrices belong to the so-called "deformed GUE". Johansson [2007] is an important paper in 
this area and contains a result, Theorem 1.12, that is not exactly suited to our problem but quite close, perhaps 
despite the appearances. Before we proceed, we note that showing that fcuE{X) is an e-approximation of 
A m ax(^Q is immediate from standard results on GUE matrices (see Trotter [1984], Davidson and Szarek 
[2001]). 

In a nutshell, random matrix theory indicates that A max (^ + (e/y/n)U) undergoes a phase transition as 
e changes when X is not a multiple of I n . If e is sufficiently large (more details follow), the behavior of 
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A m ax(^ + {(■/ \fn)U) is driven by the GUE component and the spacing between the two largest eigenvalues 
is of order n -2 / 3 . On the other hand, if e is not large enough, we remain essentially in a perturbative regime 
and the spacing between the two largest eigenvalues is larger than n~ 2 / 3 . A very detailed study of the phase 
transition should be possible, too. However, all these results are asymptotic. Non-asymptotic results could 
be obtained (the machinery to obtain results such as Johansson's is non-asymptotic) but would be hard to 
interpret and exploit. We therefore keep this discussion at an informal level. 

Smoothing by a GUE matrix should therefore give a worst case bound on || V/||l of order n 2//3 , which is 
better than the worst case bound of n we have when we smooth with rank one matrices (but requires gener- 
ating 0(n 2 ) random numbers instead of 0(n)). GUE smoothing might therefore improve the performance 
of the algorithm since the cost of generating these random variables is typically dominated by the cost of 
computing a leading eigenvector of the perturbed matrix. 

Let us give a bit more quantitative details. Based on Johansson's work and the solution to a similar 
problem in a different context (El Karoui [2007]), it is clear that the condition for the spacings to be of order 
n -2/3 j s t ^ e following (this result might be available in the literature but we have not found a reference). 
Call F n the spectral distribution of X, i.e the probability distribution that puts mass 1/n at each of the n 
eigenvalues of X. Call w c the solution in (A max (X), oo) of 

dF n (t) 1_ 

K - ty e 2 ■ 

Call Q the class of matrices for which 

liminf [w c - A max (X)] > . 

n— >oo 

Then, looking carefully at Johansson's work, it should be possible to show that: if the sequence of 
matrices X is in Q, then, if X(e) = X + e/y/nU, 

^ 2/3 A max (X(e)) -On ^ ^ 

Pn 



2 / dF n (t) 2 ( f dF n (t) \ 1/3 

a n = w c + e z / and f3 n = e z ' 



where 

Wc — t \J {W c -t) 6 / 

and TW2 is the Tracy -Widom distribution appearing in the study of GUE [see Tracy and Widom, 1994]. 
The same is true for the joint distribution of the k largest eigenvalues, where A: is a fixed integer, and TW2 
is replaced by the corresponding limiting joint distribution for the k largest eigenvalues of a GUE matrix. 

When the matrix X is not in Q, then the top two eigenvalues should have spacing greater than n~ 2//3 . We 
expect that if X has some sufficiently separated eigenvalues with multiplicity higher than one, the spacings 
there are at least n -1 / 2 , by analogy with Capitaine et al. [2009] and Baik et al. [2005]. To quantify what 
"sufficiently separated" means, we could suppose that X is a completion of a (n — ko) x (n — k^) matrix 
Xq which is in Q, to which we add ko eigenvalues A max (X), all equal and greater than A max (Xo), with 
A m ax(^Q greater than and bounded away from w c (Xq). Call F n _k $ the spectral distribution of Xq. Then, 
we should have 

^1/2 Vax(^(e)) - On , fPTTF ^ 

Pn 

where a n = A max (X) + e 2 / £^ and ? n = e (l - e 2 / ^$% ) ^ \ 

The same is true for the ko largest eigenvalues of X(e) and A max (GUEfc oX fc ) is replaced by the corre- 
sponding joint distribution for the ko x ko GUE. 

In light of the integrability problems we had in the rank one perturbation case for the inverse spectral gap 
l/(Zi(X(e)) — l2{X(e))), it is natural to ask whether such problems would arise with a GUE smoothing. 
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For this informal discussion, we limit ourselves to answering this question for the GUE. We recall that the 
joint density of the eigenvalues {Zi,G[/B}f=i of a n x ra GUE matrix is 

C Yl \ l i,GUE - lj,GUE\ 2 ex.p(-lf GUE /2) , 

l<j<j'<n 

where C is a normalizing constant. So we see immediately that 1/{1\,GUE — h,GUE) is integrable in the 
GUE setting. (The formula above is often stated for the unordered eigenvalues of a GUE matrix. The 
functional form of the density is unchanged by ordering, because of the symmetry. The domain of definition 
and the constant change when considering ordered eigenvalues, but this has no bearing on the question of 
integrability.) 

The smoothing could also be done by a matrix from the Gaussian Orthogonal Ensemble (GOE), where 
the entries above the diagonal are i.i.d AA(0, 1) and the entries on the diagonal are i.i.d A/"(0, 2) - the different 
normalization on and off the diagonal yields rotational invariance. We do not know of a result corresponding 
to Johansson's in that case, though we would expect that the behavior of the top eigenvalues is the same as 
described above, with TW2 replace by TWi, the Tracy -Widom distribution appearing in the study of GOE. 
From an algorithmic point of view, the two methods should therefore be equivalent. 

6. Numerical Experiments 

We test the algorithm detailed above on a maximum eigenvalue minimization problem over a hypercube, 
a problem used in approximating sparse eigenvectors [d'Aspremont et al., 2007]. We seek to solve 

minimize A max (^4 + X) 

subject to — p < Xij < p, for i,j = 1, . . . , n 

which is a semidefinite program in the matrix X E S n . Since randomly generated matrices A have highly 
structured spectrum, we use a covariance matrix from the gene expression data set in [Alon et al., 1999] to 
generate the matrix A E S„, varying the number of genes to vary the problem dimension n (we select the n 
genes with the highest variance). We set p = max{diag(yl)}/2 in (23). 

We first compare the performance of Algorithm 1 with that of the corresponding deterministic algo- 
rithm detailed in [Nesterov, 2007a,b], using the accelerated first-order method in [Nesterov, 2007b, §4] after 
smoothing problem (23) as in [Nesterov, 2007a; d'Aspremont et al., 2007]. We set a fixed number of outer 
iterations for Algorithm 1 and record the number of iterations (and eigenvector evaluations, these numbers 
differ because of line search steps) required by the algorithm in [Nesterov, 2007b, §4] to reach the best ob- 
jective value attained by the stochastic method. We set q = 5, k = 3 and the maximum number of iterations 
to 2Qy/n in the stochastic algorithm. To provide a complexity benchmark that is both hardware and imple- 
mentation independent, we record the total number of eigenvectors used by each algorithm to reach a given 
objective value (the matrix exponential thus counts as n eigenvectors). We report these results in Table 2. 



n 


# Iters. (Stoch.) 


# Eigvs. (Stoch.) 


# Iters. (Det.) 


# Eigvs. (Det.) 


100 


200 


6120 


100 


40400 


200 


283 


8565 


100 


81200 


500 


447 


13470 


100 


203000 



Table 2. Number of iterations and total number of eigenvectors computed by Algorithm 1 
(Stoch.) and the algorithm in [Nesterov, 2007b, §4] (Det.) to reach identical objective 
values. 

In both algorithms, the cost of each iteration is dominated by that of computing gradients. The cost of 
each gradient computation in Algorithm 1 is dominated by the cost of computing the leading eigenvector of q 
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perturbed matrices. The cost of each gradient computation in [Nesterov, 2007b, §4] is dominated by the cost 
of computing a matrix exponential. This means that the ratio between these costs grows as Oinj {q log n)). 

In Figure 2 we plot the sequence of line search parameters 7 for the stochastic algorithm together with the 
values of the Lipschitz constant L used in the deterministic smoothing algorithm, when solving problem (23) 
with n = 500. We observe that both algorithms initially make longer steps, then slow down as they get closer 
to the optimum (where the leading eigenvalues are clustered). 



50 100 150 200 250 300 350 400 450 

Iteration 



10"" 



f 




50 100 150 

Iteration 



200 



FIGURE 2. Line search parameters 7 for the stochastic algorithm (left) together with the 
values of the inverse of the Lipschitz constant L used in the deterministic smoothing algo- 
rithm (right). 



7. Appendix 

In this appendix, we recall several useful results related to the algorithm presented here. The first sum- 
marizes the complexity of computing one leading eigenvector of a symmetric matrix (versus computing the 
entire spectrum). We then prove Theorem 3.2 linking the local Lipschitz constant of the gradient and the 
spectral gap. Finally, we detail the proof of the phase transition result in Theorem 3.8 and show how the 
secular equation can be generalized to perturbations of higher rank. 

7.1. Computing one leading eigenvector of a symmetric matrix. The complexity results detailed above 
heavily rely on the fact that extracting one leading eigenvector of a symmetric matrix X € S n can be done 
by computing a few matrix vector products. This simple fact is easy to prove using the power method when 
the eigenvalues of X are well separated, and Krylov subspace methods making full use of the matrix vector 
products converge even faster. However, the problem becomes more delicate when the spectrum of X is 
clustered. The section that follows briefly summarizes how modern numerical methods produce eigenvalue 
decompositions in practice. 

We start by recalling how packages such as LAPACK Anderson et al. [1999] form a full eigenvalue 
(or Schur) decomposition of a symmetric matrix X 6 S n . The algorithm is strikingly stable and, despite 
its 0(n 3 ) complexity, often competitive with more advanced techniques when the matrix X is small. We 
then discuss the problem of approximating one leading eigenpair of X using Krylov subspace methods 
with complexity growing as 0(n 2 log n) with the dimension (or less when the matrix is structured). In 
practice, we will see that the constants in these bounds differ significantly, with the cost of a full eigenvalue 
decompositions (and matrix exponentials) growing as 4n 3 /3 while computing one leading eigenpair has 
cost cn 2 , with c in the hundreds. 
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7.1.1. Full eigenvalue decomposition. Full eigenvalue decompositions are computed by first reducing the 
matrix X to symmetric tridiagonal form using Householder transformations, then diagonalizing the tridi- 
agonal factor using iterative techniques such as the QR or divide and conquer methods for example (see 
[Stewart, 2001, Chap. 3] for an overview). The classical QR algorithm (see [Golub and Van Loan, 1990, 
§8.3]) implicitly relied on power iterations to compute the eigenvalues and eigenvectors of a symmetric 
tridiagonal matrix with complexity 0(n 3 ), however more recent methods such as the MRRR algorithm by 
Dhillon and Parlett [2003] solve this problem with complexity 0(n 2 ). Stalling with the third version of 
LAPACK, the MRRR method is part of the default routine for diagonalizing a symmetric matrix and is 
implemented in the STEGR driver (see Dhillon et al. [2006]). 

Overall, the complexity of forming a full Schur decomposition of a symmetric matrix X G S n is then 
An 3 / 3 flops for the Householder tridiagonalization, followed by 0(n 2 ) flops for the Schur decomposition 
of the tridiagonal matrix using the MRRR algorithm. 

7.1.2. Computing one leading eigenpair. We now give a brief overview of the complexity of computing 
leading eigenpairs using Krylov subspace methods and we refer the reader to [Stewart, 2001, §4.3], [Golub 
and Van Loan, 1990, §8.3, §9.1.1] or Saad [1992] for a more complete discussion. Successful termination 
of a deterministic power or Krylov method can never be guaranteed since in the extreme case where the 
starting vector is orthogonal to the leading eigenspace, the Krylov subspace contains no information about 
leading eigenpairs, so the results that follow are stochastic. [Kuczynski and Wozniakowski, 1992, Th.4.2] 
show that, for any matrix I £ S„ (including matrices with clustered spectrum), starting the algorithm at 
a random u\ picked uniformly over the sphere means the Lanczos decomposition will produce a leading 
eigenpair with relative precision e in 



iterations, with probability at least 1 — 5. This is of course a highly conservative bound and in particular, the 
worst case matrices used to prove it vary with k Lan . 

This means that computing one leading eigenpair of the matrix X requires computing at most A; Lan matrix 
vector products (we can always restart the code in case of failure) plus 4n/c Lan flops. When the matrix is 
dense, each matrix vector product costs n 2 flops, hence the total cost of computing one leading eigenpair 



flops. When the matrix is sparse, the cost of each matrix vector product is O(s) instead of 0(n 2 ), where 
s is the number of nonzero coefficients in X. Idem when the matrix X has rank r < n and an explicit 
factorization is known, in which case each matrix vector product costs 0(nr) which is the cost of two n x r 
matrix vector products, and the complexity of the Lanczos procedure decreases accordingly. 

The numerical package ARPACK by Lehoucq et al. [1998] implements the Lanczos procedure with a 
reverse communication interface allowing the user to efficiently compute the matrix vector product Xuj. 
However, it uses the implicitly shifted QR method instead of the more efficient MRRR algorithm to compute 
the Ritz pairs of the matrix £ 

7.2. Controlling the Hessian of X max (X). Consider the map /o : S n — > R such that fo{X) = A max (X). 
We want to show that its gradient is Lipschitz continuous, when the largest eigenvalue of X has multiplicity 
one and control its constant. To do so, we compute d 2 fo(X + tY)/dt 2 , where \\Y\\p = 1, and Y is 
symmetric. Let us call Ai > A2 > A3 > A n the ordered eigenvalues of X. Very importantly we assume 
that Ai has multiplicity one. If not, it is easy to see that the map we are looking at is continuous but not 
Lipschitz. We refer the reader to [Kato, 1995; Overton and Womersley, 1995; Lewis and Sendov, 2002] for 
a more complete discussion. We have the following theorem. 



k 



Lan 



< 



log(n/S 2 ) 



of Xis 
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Theorem 7.1. Suppose X is an n x n symmetric matrix with decreasingly ordered eigenvalues {Aj}f =1 . 
Call fo(X) = A max (X) and suppose that A max (X) has multiplicity one. Let Y be a symmetric matrix with 
\\Y\\p = 1- Let us call 

d 2 f (X + tY) 

9 ^ Y) = \^ — w — • 

Then we have 

||V/ (X)|U= sup 9 {X,Y) = \ 1 . (24) 

Proof. The strategy is to first exhibit a matrix Y c in S n that will give us the right-hand side of Equation (24) 
as a lower bound. And then we will show that indeed this bound is the best one can do. We will rely heavily 
on the following classical result from the analytic perturbation theory of matrices. We can use [Kato, 1995, 
p.81] to get 

j— ^ 

where fa is an eigenvector corresponding to the eigenvalue Ai and fa is an eigenvector corresponding to the 
eigenvalue Aj. Here we have crucially used the fact that \i(X) has multiplicity one. 
Finding a lower bound for \\Vfo(X) Let O be an orthonormal matrix that transforms the canonical basis 
(ei, . . . , e n ) into the orthonormal basis (fa, . . . , <p n ). In other words, Oe« = fa and hence T fa = ej. Let 
us call Pq the matrix that exchanges e\ and e 2 and send the other gj's to 0. In other words, the 2x2 upper 

1 N 



left block of Pq is the matrix I ^ and Pq is zero everywhere else. Now call 

Note that Y c G S„. Since T fa = e*, we see that Y c fa = fa/y/2, Y c 4> 2 = fa/y/2, and Y c fa = if j > 2. 
Further, ||y c ||J, = Tr Y C T Y C = TrY c 2 = TrOP 2 T /2 = \\P \\ 2 F /2 = 1. Now, $Y c fa = ^||^|| 2 /x/2. 
Hence, 

9 ^ Yc) = 2\ x (X)-\ 2 (X) ' 

and therefore, 

II^WH' 2 I MX) -x 2{ x) ■ 

Finding an upper bound for [|V/o(-X")||x. On the other hand, we clearly have, for j > 2, < l/(Ai(X) — 
Aj(X)) < l/(Xi(X) - X 2 (X)). Therefore, 



n j ^ n 

Since {</>j}? = i form an orthonormal basis, and F is symmetric, 



,J 2 



3=1 



2 

1 II 2 ■ 



As a matter of fact fa^Yfa is just the coefficient of the vector Y T fa = Y fa in its representation in the basis 
of the 0j's. We therefore have 
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Now let us call yij the (i, j)-th entry of the matrix that represents Y in the basis of the 0j's. Since \\Y 



Using the symmetry of Y, we therefore see that 



Now, llr^Hl = £" =1 y?j and {(flY^) 2 = y\ v Hence, 

n 1 — 7~/ 2 1 



J'=2 



We conclude that 



vy G s n , ||y||F = i, g(x,Y)<- 



and therefore 



2A!(X)-A 2 (X) ' 

||V/ (X)|| L = sup g(X,Y)<\ - 1 . 
Since we have matching upper and lower bounds for || V/o(X) ||l, we have established the theorem 



7.3. Phase transition. We prove Theorem 3.8 in this section. 
7.3.1. Preliminaries. Let us call 



91 



l A l 

« ^ t + 7 + <5j 



wo = 1 E 



n t ' n ^— ' t + 7 + Sj 

j=l+i J 



n t 



+ 9l(t) , 



n t 

Recall that l\ = Ai + T is the root of the equation 



i 



+ - E 



n T n ^— ' T + 7 + 5,- 



It is clear that T > (e/n) ^Li A- Mso > h'(t) < on (0, oo), so h is invertible. We note that 



var 



n 



3=1 J 
Z? - 1 



n ■ t + 7 + 



1 



(t + 7 + 5i) 2 



n 7^ 



So the error made when replacing hi by gi when seeking the root of Equation (26) is Op{l/ yfn). 
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Our strategy is to expand T in powers (possibly non-integer) of 1/n. We call t(m) an approximation of 
T to order m. If we can find an approximate solution t(m), such that 

\h(t(m)) - i| = <3 P (n -/3 ) , for some /? , 

we claim that 

\t(m)-T\ = P (n-P) . 

e yj'._ z 2 

This is because h is, at Zj fixed, a Lipschitz function on ( — 3 ~ 1 3 , oo), and its Lipschitz constant is bounded 
below with high-probability on any compact interval of this interval. Hence, we have 

\t{m) -T\ = Ih-^hMm))) - /i _1 (/»(T))| < l)/^ 1 || L |/i(i(m)) - i| = Op^) . 

Note that if we can show that \h'(y)\ > Cn b in a neighborhood of t(m), then we get by the same token 

\h{t{m)) - ^\ = P {n-P) =► \t(m) - T\ = P {n~^) . 
We finally recall that 

l . , l ,A l 

-=g l =- E — T- 
7.3.2. Case e < eo- Recall that the equation defining T is 

l- h( ? ) -—^T + n ^ T + 7 + ^-^^T +/lz(T) - 

In this case, 

1 1 

si(0) = - < - , 

so it is clear that the term — 3 n ^ 3 needs to enter into play to "saturate" the equality. In particular, T is going 
to be of order 1/n. But we can expand it further. 

Let us now expand the last term above, i.e hi(t), in powers of i's. Because hi is uniformly bounded in 
probability for t in a neighborhood of 0, we have 

1 _n _2 1 n v 2 1 ™ 7 2 

z 2 z 2 

So calling Ci = \ TJj=i+i ( 7+ i) 2 ' and ^ = n £j=J+i the equation defining T becomes 



1 v 2 1 1 1 ™ Z 2 - 1 

e n T e n .f-f 7 + 



We see that by taking 



2 1 n v 2 1 
l_j " n n3/ 2 ^ ^ 7 + *j ' 

J=«+l 



with 



we have 



OL\ = 1 _ y , 



/i(t(2)) - /i(T) = P (\/n) . 
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Now, we note that in a neighborhood of 1/n, the derivative of h is bounded below in absolute value and in 
probability by Op(n). Our argument in the previous subsection therefore allows us to conclude that 

T-t(2) = P (\ 

7.3.3. Case e = cq. We can use the same expansion in t as above, but Equation (26) defining T becomes 



- = ^ + 1 + - E ^-T Cl + 0(T 2 ) 



where, as above, 



i=i+i ' J 



1 n z 2 

Ci = - E ' 



Because £1 = Sj=i+i = Op(l), we see that now, T is of order Ify/n. Using the ansatz t(l) 
a I y/n, we see that a should equal (recall that a > 0), 



ei + v^i+wi 

a = — 

2Ci 

Now 

1 „ (\ 



h(t(l)) - - = Op - 



and in a neighborhood of a/y/n, h is Lipschitz with Lipschitz constant bounded away from 0. Hence, as 
argued in 7.3.1 

T=^ + Op(i). 



'n n 

1.3.4. Case e > 6q. Recall that the equation defining T is 



1 v 2 1 1 n z 2 - 1 1 n 

- x < + 1 V A + £ y 



e n T n T + 7 + 5, re ^— ' T + 7 + <5, 
When e > eo, we can find to bounded away from such that 

1 1 ^ 1 



e n . z r- / 1 to + 7 + 

to is furthermore bounded. So T is going to converge to to and the question is to understand how far away 
it is. By writing T = to + rj, after expanding the equation characterizing T around to, we see that we have 

where 



reto y'-t + ^ m - vC{t0) = O{vh 



1 A A - 1 



e(t ) = -J= E ; 1 _i_ x = °Hi) 

and 
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We conclude that, informally, rj = Op(^). Now let us verify it properly. Let us call 

m = to + ifM. 

V«C(*o) 

The expansion above shows that 

- - h(t(l)) = P (l/n) . 

€ 

Because h is Lipschitz with Lipschitz constant bounded below in a neighborhood of to, we conclude as in 
7.3.1 that 

T = t 0+ '^ + O P ( 1 -). 

7.4. On the secular equation and higher-order perturbations. We give an elementary proof of the va- 
lidity of the secular equation, which avoids matrix representations. Though simple and likely well-known, 
the advantage of our derivation is that it extends easily to higher rank perturbation. More precisely, let us 
consider the matrix 



M l = M + U , (27) 
where U is a symmetric matrix. We assume without loss of generality that M is diagonal. We write 



U = X^=i v j v J- We do not require the Vj to be orthogonal and they could also be complex valued in what 



follows. 

Let us call \± > X 2 > . . . > X n the eigenvalues of M and compute the characteristic polynomial of M\ 
and relate it to that of M. We call 

P Ml (A) = det(Mi-AI n ) , 
Pa* (A) = det(M - AI„) , 
M\ = M — AI n . 

Assuming for a moment that A is not an eigenvalue of M, we clearly have Mi — AI n = M\(I n + M^ 1 U). 
We call G(A) the k x k matrix with (i, j) entry vjM^ 1 V{. 
We have 

P Mi (A) = det(M A ) det(I n + M^U) = P M (A) det(I fc + G{\)) , 

since det(I n + AB) = det(Ifc + BA) for rectangular matrices A and B whenever AB is n x n and BA is 
k x k. The previous formula can be used to study the eigenvalues of finite rank perturbations of M, since 
they are the zeros of the characteristic polynomial P^ x . 

Let us focus on the rank one case. Since we assume wlog that M is diagonal, we have, when k = 1, 

n 9 

det(I fc + G(A)) = det(l + v T M^v) = 1 + ^ y- 
We therefore get , when A is not an eigenvalue of M, 



1 Aj — A 



II(A* - A) 



i=l 



" 2 \ 

1+ Erh - (28) 



Aj - A 
i=i 



from which the secular equation follows. From Equation (28), it is also clear that if Aj is an eigenvalue of 
M with multiplicity k, Aj is also an eigenvalue of M\ with multiplicity k — 1. 
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